Instructors

Gabriela Palomo gabriella.palomo@gmail.com

  • https://twitter.com/gabbspalomo
  • Postdoc at the University of Maryland
  • Carnivore ecologist but nowadays I am more of a Data Scientist

Hannah Griebling
hgriebli@mail.ubc.ca

  • PhD Candidate at University of British Columbia
  • Cognitive ecologist: focus on urban raccoon cognition and behavior

Learning objectives

  • After today’s lecture, you’ll be able to:

    • Understand the structure of tidy data
    • Understand the main tidy verbs in dplyr to help tidy data
    • Organize and clean data downloaded from UWIN to run a single species single season occupancy analysis

Organize the project and directory

  • I highly recommend you use RStudio Projects instead of setwd()

  • Go to RStudio and click on File > New Project.

Now you see three options:

  • New directory: choose this option if you want to create a folder that will contain all the subdirectories and files of this particular project.
  • Existing directory: use this option if you already created a folder which will contain all the subdirectories and files for this particular project. Choose that folder here.
  • Version Control: choose this option if you are going to work with a repository already stored in GitHub.

For our own project, let’s go ahead and choose New Directory and let’s name our project: 2024-data-manipulation-UWIN.

Other files inside the main directory

You will have a series of directories inside your project, depending on the type of work that you’ll be working on. Some people recommend following the same structure that you would use if creating an r package. However, I think that at a minimum, you could have the following structure:

Other files inside the main directory

  • Data is a directory that has all your original .csv files with the data that you will use in your analysis.
  • Functions is a directory that houses all the functions you create and that you will be using throughout your analysis. Some people include this directory as a subdirectory of R.
  • Plots is a directory in which you will put all the graphs you create as part of your analysis.
  • R is a directory that will have all the scripts needed for your analysis.
  • Results is a directory that you may or may not need. The idea is to include all the resulting .csv or .rds files in here and keep them separate from your original files.
  • You may need other directories, especially if you are working with spatial data, for example, shapefiles, rasters, maps, etc.

Naming files

  • File names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files.

  • File names should be human readable: use file names to describe what’s in the file.

  • File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used.

Examples:

  • Document 1.docx
  • manuscript_final.docx
  • final_document_final.qmd
  • data.csv
  • 2024_05_03_manuscript_name.R
  • 01_data_cleaning.R
  • 02_model.R
  • fig-01.png
  • exercise-uwin-workshop.qmd

Why are these good names? Well because if you have several of those, you can arrange them by date (descending or ascending), or by order of fig-01, fig-02.

Warning

It’s important to note that fig-01.png is not the same as fig-1.png because your computer will read the following files in this order: fig1.png, fig10.png, fig11.png, fig2.png.

Let’s talk about pipes

  • At the beginning there was only one pipe operator, %>%, which is from the magrittr package.

  • The idea is to have a way to pipe an object forward into a function or call expression.

  • It should be read as ‘then’. For example: The following code is read as follows: start with object df THEN select col1.

df %>% select(col1)

Native pipe in base R

  • Now, base R has it’s own pipe called native pipe, |>, which is also read as ‘then’.

  • You can activate this native pipe by going to Tools > Global options > Code and selecting that option.

  • You can read more about the differences between both pipes here.

dplyr verbs: data transformation

  • dplyr is a package based on a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

    • mutate() adds new variables that are functions of existing variables
    • select() picks variables based on their names
    • filter() picks cases based on their values
    • summarise() reduces multiple values down to a single summary
    • arrange() changes the ordering of the rows
    • group_by() groups variables for you to perform operations on the grouped data. Always remember to ungroup() once you are finished
  • These can be linked together by pipes |> or %>%

  • Cool cheatsheet for dplyr

tidyr for tidying data

  • The tidyr package has a series of functions that are named after verbs that will help you tidy and clean data.

  • The goal of tidyr is to help you create tidy data. Tidy data is data where:

    • Each variable is a column; each column is a variable

    • Each observation is a row; each row is an observation

    • Each value is a cell; each cell is a single value

  • Cool cheatsheet for tidyr